Initial Idea
The other day I was thinking about a simple population model. Start with $N$ people, then select one of them with uniform probability and give it a child. So now there are $N+1$ people in the population and one of the original $N$ has one offspring. I became interested in the distribution of number of children an individual has after this process is repeated $T$ times. Initially I wanted to run some simulations to get some numbers. The intuitive simulation is to run this as a graph where each person is a node, a new node is introduced at each iteration, and then we can connect it to the node that was selected to have a child. At the end of the $T$ runs, we can just iterate through nodes and then count its edges to get a distribution. But before I implemented it, a better approach came to my mind. Because the number of iterations are known beforehand, we can actually just run this in an array of size $N + T$. Then at each iteration, $i$, we can select an index between 0 to $N + i$ and just increment it by one. Thinking about the problem as an array actually made it much clearer on how to work this out mathematically and save my computer the hard work of running these simulations.
Players Get Chose
The probability of selecting individual $i$ at time $t$ to have an offspring can be written with the expression below. \(P[i, t] = \frac{\mathbf 1_{ i < N + t}}{N + t - 1}\)
where $\mathbf 1_{S}$ denotes the indicator function of the event $S$ aka a fancy conditional. This is all good, but the probability distribution we are after is the probability of individual $i$ having $k$ kids at time $t$. What we should notice about this is that in order to have $k$ kids at time $t$, you either had $k$ kids already and weren’t selected or you had $k-1$ kids and were selected. This allows us to write the distribution recursively. If we let the number of children individual $i$ has at time $t$ be denoted by $X_i^t$, the probability distribution can be written as shown below.
\[P[X_i^t = k] = P[X_i^{t-1} = k - 1]P[i, t] + P[X_i^{t-1} = k] \left( 1 - P[i, t]\right)\]Notice that, because the population starts at everyone having no offsprings, $P[X_i^0 = 0] = 1$.
Mathematical Analysis
Although, we could just compute the probability distribution from the recursive formula to study this process, it turns out that some properties of the distribution are somewhat easy to compute. (Spoiler alert) Although, these values don’t have closed forms, some cool things pop up, and lead to some neat approximations.
Since the first $N$ individuals will have the same properties, we will just analyze the distribution of the first individual, $X^t$. Let $Y_t = X^t - X^{t-1}$ and notice that it is either equal to 1 or 0, because either individual one gets chosen to have a child, or it doesn’t. That makes $Y_t$ a Bernoulli distributed variable with probability of 1, $p_t = \frac{1}{N + t - 1}$ with $Y_0 = 0$. With this information in mind, we can get the mean of $X^t$ fairly easily.
Mean
\[E\left[ X^t \right] = E\left[ \sum_{i = 1}^t Y_i \right] = \sum_{i = 1}^t p_t = \sum_{i = 1}^t\frac{1}{ N + i - 1}\]It’s essentially the difference of two partial sums of the harmonic series. From a computational perspective that means we can compute the mean in $O(t)$. Were it not for this computation we would have had to do $O((N + t)^2)$ operations to get the probability distribution recursively and then compute the mean for our given time. Furthermore, if we don’t mind a small error or are working with large $t$ values, we can get a constant time approximation. To do that, we write the mean as the difference of two partial harmonic series and the fact that $H(x) \approx \log{x} + \gamma$; where $\gamma$ is the Euler–Mascheroni constant and is approximately 0.57721. So we can efficiently compute the mean with the formula below
\(E\left[ X^t \right] = H(N + t - 1) - H(N - 1) \approx \ln{\left[N + t - 1\right]} + \gamma - \ln{\left[N - 1\right]} - \gamma\) \(E\left[ X^t \right] \approx \ln{\left[ \frac{N + t - 1}{N - 1} \right]}\)
Variance
Following a similar analysis as above, we can get that the variance is the following.
\[Var\left[ X^t \right] = Var\left[ \sum_{i = 1}^t Y_i \right] = \sum_{i = 1}^t Var\left[ Y_i \right] = \sum_{i = 1}^t p_t \left( 1 - p_t \right) = \sum_{i = 1}^t\frac{1}{ N + i - 1} - \sum_{i = 1}^t\frac{1}{(N + i - 1)^2}\]Notice that the first term is the same as the mean and the second term looks like the partial sum of the reciprocal of squares. Like for the mean, we can write this as a difference of sum of reciprocal squares.
\[\sum_{i = 1}^t\frac{1}{(N + i - 1)^2} = \sum_{i = 1}^{N + t - 1}\frac{1}{i^2} - \sum_{i = 1}^{N - 1}\frac{1}{i^2}\]Anyone familiar with the Basel problem knows that for large $N + t$ we can approximate the first term in the right-hand side with $\pi^2 / 6$. Putting all of this together we get the following $O(N)$ approximation for computing the variance.
\[Var\left[ X^t \right] \approx \ln{\left[ \frac{N + t - 1}{N - 1} \right]} + \sum_{i = 1}^{N - 1}\frac{1}{i^2} - \frac{\pi^2}{6}\]Distribution
Unfortunately, there isn’t really any nice(ish) way to get the distribution. Because the process can be written out as a sum of independent variables I though that maybe the way to go was to use a probability-generating function, which doesn’t really leave us with anything nice. Best we can do is the equality shown below, which will come down to a bunch of nasty product and chain rules.
\[P\left[X^t = k \right] = \frac{1}{k!} \frac{d^k}{(dz)^k} \left[ \prod_{i=1}^t \left(\frac{z + N + i - 2}{N + i - 1}\right) \right]_{z=0}\]Interestingly, this distribution that came from my stupid model for population dynamics has a name. It’s called the Poisson-binomial distribution, and hopefully it’s used to model more realistic process than the crap I came up with. An upside to this is that the wikipedia page points to a library called poibin that implemented some relatively new and efficient methods to compute this distribution function.
Distribution of Some Individuals
Useful Functions For Simulations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import warnings
warnings.filterwarnings('ignore')
from typing import Dict, Tuple
class ProbabilityDistribution(dict):
def __init__(self, individual: int, N: int):
super(ProbabilityDistribution, self).__init__()
self[0] = 1.0
self._individual = individual
self._N = N
self.t = 0
self._my_t0 = max(self._individual - self._N, 0)
def __getitem__(self, item) -> float:
return self.get(item, 0.0)
def _update_dist(self, new_dist) -> None:
for k, p in new_dist.items():
self[k] = p
def next_step(self) -> None:
new_dict = {}
self.t += 1
if self.t <= self._my_t0:
return
max_kids = self.t - self._my_t0
for i in range(max_kids + 1):
p = 1 / (self._N + self.t)
x = 0.0
if i > 0:
x += self[i - 1] * p
if i < self._N - 1:
x += self[i] * (1 - p)
new_dict[i] = x
self._update_dist(new_dict)
def individuals_to_track(individuals: Tuple, initial_pop: int) -> Dict[int, ProbabilityDistribution]:
return dict(map(lambda i: (i, ProbabilityDistribution(i, initial_pop)) , individuals))
Animation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
%matplotlib inline
import matplotlib
matplotlib.rcParams['animation.embed_limit'] = 2**128
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
# Make generator with three individuals
simulation_length = 1500
initial_population_size = 100
individuals_of_interest = (1, 101, 300, 1000)
individuals = individuals_to_track(individuals_of_interest, initial_population_size)
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(autoscale_on=False, xlim=(0, 50), ylim=(0., 1.))
ax.set_title(f"t=0, n={initial_population_size}")
ax.set_xlabel("Kids")
ax.set_ylabel("Probability")
ax.grid()
for i, p in individuals.items():
x = list(range(1))
y = [p[k] for k in x]
ax.plot(x, y, label=i)
def animate(t):
ax.clear()
for i, p in individuals.items():
p.next_step()
x = list(range(t + 1))
y = [p[k] for k in x]
ax.plot(x, y, label=i)
ax.set_title(f"t={t+1}, n={initial_population_size}")
ax.set_xlim((0, 50))
ax.legend()
fps = 10
anim = FuncAnimation(fig, animate, simulation_length + 1, interval=1000/fps, blit=False)
HTML(anim.to_jshtml())